reader: implement parallel CSV reading #2070

Riolku · 2023-09-22T19:45:28Z

This also refactors the CSVReader class to enable this change.

Riolku · 2023-09-22T19:49:15Z

In testing, LDBC 100 loads in 3 seconds on ac4 (with 128 cores). However, this does not account for the overhead of counting the number of rows. If we are not going to make hash indexes resizable soon, we should add a dedicated CSV row counting function.

Furthermore, the 3 seconds does not account for the overhead of writing to disk, since I ran:

LOAD FROM <csv_path> RETURN COUNT(*)

On LDBC-10, which is much nicer to benchmark because the serial CSV reader loads it quickly enough, I got these numbers:

Serial: ~12s.
Parallel: ~350ms.

Again, in practice, we see only a 2x speedup since we pay the price of counting the rows.

andyfengHKU · 2023-09-25T04:25:23Z

src/include/common/copier_config/copier_config.h

@@ -19,18 +19,21 @@ struct CSVReaderConfig {
    char listBeginChar;
    char listEndChar;
    bool hasHeader;
+    bool parallel;


Not really related to this PR. I wonder if we should merge CSVReaderConfig with ReaderConfig at some point. Each reader can access a subset of fields from ReaderConfig class.

Some attributes can be shared. Parallel definitely should be. Others though... don't make sense, right?

codecov · 2023-09-25T16:01:02Z

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (f05d348) 89.55% compared to head (495e283) 89.60%.
Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2070      +/-   ##
==========================================
+ Coverage   89.55%   89.60%   +0.04%     
==========================================
  Files         981      985       +4     
  Lines       35901    35745     -156     
==========================================
- Hits        32151    32028     -123     
+ Misses       3750     3717      -33

Files	Coverage Δ
src/binder/bind/bind_file_scan.cpp	`88.05% <100.00%> (+6.80%)`	⬆️
src/binder/bind/bind_reading_clause.cpp	`97.00% <100.00%> (ø)`
src/common/copier_config/copier_config.cpp	`54.54% <100.00%> (-3.79%)`	⬇️
src/include/common/copier_config/copier_config.h	`100.00% <100.00%> (ø)`
...r/operator/persistent/reader/csv/base_csv_reader.h	`100.00% <100.00%> (ø)`
...erator/persistent/reader/csv/parallel_csv_reader.h	`100.00% <100.00%> (ø)`
...operator/persistent/reader/csv/serial_csv_reader.h	`100.00% <100.00%> (ø)`
...e/processor/operator/persistent/reader_functions.h	`100.00% <100.00%> (ø)`
...clude/processor/operator/persistent/reader_state.h	`100.00% <ø> (ø)`
src/processor/operator/persistent/copy_node.cpp	`95.91% <100.00%> (+0.71%)`	⬆️
... and 6 more

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/include/common/copier_config/copier_config.h

src/include/processor/operator/persistent/reader/csv/base_csv_reader.h

src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h

src/processor/operator/persistent/reader/csv/base_csv_reader.cpp

src/processor/operator/persistent/reader.cpp

src/include/processor/operator/persistent/reader_functions.h

src/include/processor/operator/persistent/reader/csv/serial_csv_reader.h

src/include/processor/operator/persistent/reader/csv/parallel_csv_reader.h

src/processor/operator/persistent/reader/csv/parallel_csv_reader.cpp

dataset/csv-edge-case-tests/bom-and-data.csv

This also refactors the CSVReader class to enable this change.

Ref #2070.

ray6080 · 2023-10-06T04:00:15Z

src/processor/operator/persistent/copy_node.cpp

@@ -179,9 +179,10 @@ void CopyNode::checkNonNullConstraint(NullColumnChunk* nullChunk, offset_t numNo
 }

 void CopyNode::finalize(ExecutionContext* context) {
-    auto numNodes = StorageUtils::getStartOffsetOfNodeGroup(sharedState->getCurNodeGroupIdx()) +


Why do we make this change? The change looks like a bug to me. This could just lead to 0 numNodes in statistics.

Riolku force-pushed the parallel-csv branch 3 times, most recently from 00500dc to 1a48115 Compare September 23, 2023 02:46

andyfengHKU reviewed Sep 25, 2023

View reviewed changes

Riolku force-pushed the parallel-csv branch 3 times, most recently from 0846d9a to 18c7e9f Compare September 25, 2023 15:43

Riolku force-pushed the parallel-csv branch from 18c7e9f to e68d69e Compare September 25, 2023 20:38

Riolku marked this pull request as ready for review September 25, 2023 20:38

Riolku marked this pull request as draft September 25, 2023 20:39

Riolku force-pushed the parallel-csv branch 8 times, most recently from 695fa48 to df729ba Compare September 28, 2023 20:16

Riolku requested a review from acquamarin September 28, 2023 20:38

Riolku marked this pull request as ready for review September 28, 2023 20:42

andyfengHKU reviewed Sep 28, 2023

View reviewed changes

acquamarin approved these changes Sep 28, 2023

View reviewed changes

Riolku force-pushed the parallel-csv branch 2 times, most recently from 1cdb9b5 to 33b8ff4 Compare September 28, 2023 21:35

Riolku requested a review from ray6080 September 28, 2023 21:53

reader: implement parallel CSV reading

495e283

This also refactors the CSVReader class to enable this change.

Riolku force-pushed the parallel-csv branch from 33b8ff4 to 495e283 Compare September 29, 2023 03:32

ray6080 merged commit 53d91db into master Sep 29, 2023
10 of 11 checks passed

ray6080 deleted the parallel-csv branch September 29, 2023 05:55

Riolku added a commit that referenced this pull request Sep 29, 2023

reader: remove moved constant from Parallel CSV

4dc0595

Ref #2070.

Riolku mentioned this pull request Sep 29, 2023

reader: remove moved constant from Parallel CSV #2110

Merged

Riolku added a commit that referenced this pull request Sep 29, 2023

reader: remove moved constant from Parallel CSV

8a32704

Ref #2070.

ray6080 reviewed Oct 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reader: implement parallel CSV reading #2070

reader: implement parallel CSV reading #2070

Riolku commented Sep 22, 2023

Riolku commented Sep 22, 2023 •

edited

Loading

andyfengHKU Sep 25, 2023

Riolku Sep 25, 2023

codecov bot commented Sep 25, 2023 •

edited

Loading

ray6080 Oct 6, 2023

reader: implement parallel CSV reading #2070

reader: implement parallel CSV reading #2070

Conversation

Riolku commented Sep 22, 2023

Riolku commented Sep 22, 2023 • edited Loading

andyfengHKU Sep 25, 2023

Choose a reason for hiding this comment

Riolku Sep 25, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 25, 2023 • edited Loading

Codecov Report

ray6080 Oct 6, 2023

Choose a reason for hiding this comment

Riolku commented Sep 22, 2023 •

edited

Loading

codecov bot commented Sep 25, 2023 •

edited

Loading